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Abstract 

In latent Dirichlet allocation (LDA), topics are multino- 
mial distributions over the entire vocabulary. However, the 
vocabulary usually contains many words that are not relevant 
in forming the topics. We adopt a variable selection method 

O widely used in statistical modeling as a dimension reduction 
tool and combine it with LDA. In this variable selection model 
for LDA (vsLDA), topics are multinomial distributions over a 
subset of the vocabulary, and by excluding words that are not 
O informative for finding the latent topic structure of the corpus, 

vsLDA finds topics that are more robust and discriminative. 
I We compare three models, vsLDA, LDA with symmetric pri- 

ors, and LDA with asymmetric priors, on heldout likelihood, 
(*f^ MCMC chain consistency, and document classification. The 

performance of vsLDA is better than symmetric LDA for like- 
f^**) lihood and classification, better than asymmetric LDA for con- 

t-H sistency and classification, and about the same in the other 

ly-^ comparisons. 

o 

y—{ 1 Introduction 

. £^ Latent Dirichlet allocation (LDA) [2] , a widely used topic model, decomposes a 

corpus into a finite set of topics. Each topic is a multinomial distribution over 
the entire vocabulary, which is typically defined to be the set of all unique words 
with an optional step of removing stopwords and high frequency words. Even 
with the preprocessing step, the vocabulary will almost certainly contain words 
that do not contribute to the underlying topical structure of the corpus, and 
those words may interfere with the model's ability to find topics with predictive 
and discriminative power. More importantly, one cannot be sure whether and 
how much the vocabulary influences the topics inferred, and there is not a 
systematic way to compare different vocabularies for a given corpus. We relax 
the constraint that the vocabulary must be fixed a priori and let the topic model 
consider any subset of the vocabulary for representing the topics. 

We propose a model-based variable selection |9] for LDA (vsLDA) that 
combines the process of identifying a relevant subset of the vocabulary with 



1 



the process of finding the topics. Variable selection has not been studied in 
depth for LDA or any other topic model, but three models, HMM-LDA [7], 
sparseTM [22], and SWB [4] achieve a similar effect of representing the topics 
with a subset of the vocabulary. HMM-LDA [7] models the short- and long-range 
dependencies of the words and thus identifies whether words are generated from 
the syntactic (non-topic) or the semantic (topic) class. SparseTM [22] aims to 
decouple sparsity and smoothness of the word-topic distribution and thereby 
excludes some words from each topic. SWB seperates word tokens into the 
general and specific aspects, and it is probably the most similar work to ours 
in that it also globally excludes words from forming the topics. However, SWB 
excludes word tokens, whereas vsLDA excludes word types. By looking at the 
word types, we can replace the necessary but arbitrary step of deciding the 
vocabulary for forming the topics, which usually includes the removal process 
of useless words. Such process typically uses a list of stop words and corpus- 
dependent infrequent and highly frequent words, and in this work, we show 
the inadequacy of such preprocessing approach to variable selection. We can 
also view this problem of variable selection as a type of model selection along 
the vocabulary dimension. Model selection has been well studied for the topic 
dimension with nonparametric topic models 19, 22 but not for the vocabulary 
dimension. 

This paper is organized as follows. We first describe our vsLDA model for 
selecting informative words. We derive an approximate algorithm for posterior 
inference on the latent variables of interest in vsLDA based on Markov Chain 
Monte Carlo and Monte Carlo integration. We demonstrate our approach on 
a synthetic dataset to verify the correctness of our model. Then we run our 
model on three real-world datasets and compare the performance with LDA 
with symmetric priors (symLDA) and LDA with asymmetric priors (asymLDA). 
We show that vsLDA finds topics with better predictive power than symLDA 
and more robustness than asymLDA. We also find that vsLDA reduces each 
document into more discriminating subdimensions and hence outperforms the 
other models for document classification. 

2 Variable Selection for LDA (vsLDA) 

LDA is typically used with a preprocessing step of removing stopwords and 
the words that occur frequently throughout the corpus. The rationale is that 
the words pervading the corpus do not contribute to but hinder the process of 
discovering a latent topic structure. This frequency-based preprocessing step 
excludes the words a priori independent of constructing the latent topics. How- 
ever, we cannot be certain whether the excluded words are truly non-informative 
for topic construction. Also, the same uncertainty applies to the included words. 
Here, we propose a new LDA model where the word selection is conducted simul- 
taneously while discovering the latent topics. The proposed approach combines 
a stochastic search variable selection [5] with LDA, providing an automatic word 
selection procedure for topic models. 
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Suppose we have a vocabulary with size V with or without any preprocessing. 
In a typical topic model, topics arc defined on the entire vocabulary and assumed 
to be Dirichlet-distributed on V — 1 simplex, i.e., 

&~DirG81), fce {1,2,3,...,*}, (1) 

where 1 is F-dimensional vector of Is and K is the number of topics. Our 
assumption is that the vocabulary is divided into two mutually exclusive word 
sets; one includes informative words for constructing topics, and the other con- 
tains non-informative words. Also, the topics are assumed to be defined only 
on the informative word set and distributed as 

fc ~DirG0s), fee {1,2,3,...,*}, (2) 

where s = (si, . . . , sy) and Sj is an indicator variable defined as 

{1, word j is a informative word, , . 

0, word j is a non-informative word. 

In other words, s specifies a smaller simplex with a dimension Y^j=i s j ~ 1 f° r 
the informative word set. Not knowing a priori whether a word is informative 
or non-informative, we assume Sj ~ Bernoulli(X) to incorporate uncertainty in 
informativity of words. 

Now, we describe the generative process for vsLDA which includes the steps 
for dividing the entire vocabulary into an informative word set and a non- 
informative word set (step 1) and determining the membership of a word token 
cither as one of the topics or as the non-informative word set (step 4(b)). 

1. For each word j £ {1, 2, . . . , V}, draw word selection variable Sj ~ Bernoulli(A) 

2. For each topic k £ {1,2, ... , K}, draw topic distribution cf) k ~ Dir(/3s) 

3. For a non-informative words set, draw words distribution ip <~ Dir(7S c ) 

4. For each document d € {1,2,..., D}: 

(a) Draw topic proportion 9d ~ Dir(a) 

(b) For ith word token, draw bdi ~ Bernoulli(r): 

i. H6* = l: 

A. Draw topic assignment zn ~ Mult(^d) 

B. Draw word token Wdi ~ Mult(</> 2di ) 

ii. else 

A. Draw word token Wdi ~ Mult(^) 
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From the generative process for a corpus, the likelihood of the corpus is 
P(W,e,*,V,*,M|a,0,7,A,r) 

D D 

=n^Hii n M™di\<p Zdi )p(z di \e d )} 

d=l d=l {i:b dt = l} 

D D N d 

*n n p^\^)\{Y[p{b dl \T) 

d=l {i:b di =0} d=li=l 
K V 

fc=l j=l 

where is the number of word tokens in document d. 

Placing Dirichlet-multinomial conjugate priors over Q,$,ip naturally leads 
to marginalizing out these variables. 

p(W,z,b,s\a,/3,j,\,T) = 



D 

1 



II II P{zdi\0 d )p{6 d \a)dQ 

d=l {i:b di = l} 

K D 

x ' 



f f[p(4> k \f3, S )f[ n P { Wdi \<j> Zdi )d^ 

*fc=l d=l {i:b di = l} 



D 

P(^l7.s)JJ Yl P( w dr\^)dtp 

d=l {i:b dt =0} 
V D N d 

xn^#)nn^w (4) 

j=l d=li=l 

i A inf=ir(a fc )r(Ef=i«5. + «*) 

x r(E {j - ia . =0} 7j) ,j n™. 3 + 7 j) 
1I(, > ;\ r (7j) r(E {j:S3 =o} + 7j) 

x Al s l(l-A)l y l-I s l •r" : (l-r) m - 

where is a number of word tokens in the rfth document with the jth word 
in the vocabulary assigned to the fcth topic where Sj = 1, and m d j is a number 
of word tokens in dth document with the jth word where Sj = 0. The dots 
represent the marginal counts, so m.j represents the number of word tokens of 
jth word across the corpus. 
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Figure 1: Result of the simulation study. The first row shows the topics used 
to generate the data. The second, third, and fourth rows show the inferred 
topics from vsLDA, LDA10, and LDA11, respectively. For all models, we use 
an asymmetric prior a over the per-document topic proportions. 



3 Approximate Posterior Inference 

Deriving exact posterior distributions for the latent variables in vsLDA is in- 
tractable. We propose an MCMC algorithm to obtain posterior samples in order 
to make an approximate inference. Marginalizing over $, t/j, and 0, the remain- 
ing latent variables in the joint likelihood are z, b, and s in equation Q. Given 
the word selector s and the observed data W, b is determined because ba = 1 
for all informative word tokens and b^i = for all non-informative word tokens. 
Therefore, we sample only z and s through a collapsed Gibbs updating and a 
Metropolis updating relying on a Monte Carlo integration, respectively. 

Step 1 : Sampling z : Given W and s, we sample Zdi only for d and i such 
that Wdi — j and Sj — 1 (i.e., only for the word tokens taking the values in the 
informative word set). Letting z_ di — {z d 'v : d! ^ d or i' ^ i}, the conditional 
distribution of Zdi given z^di, s, and W is 



n k +8 

p{z di = k\W,z,s)<x{ni+a) ^ / (5) 

n k .+B )J J= i Sj 

which depends only on the number of informative words and the topic assign- 
ments of the other informative word tokens. This is a generalization of updating 
step for topic assignment in typical LDA models where the vocabulary size is 
fixed as V while it varies as 2j=i s j m our model. 

Step 2 : Sampling s : We let 2 J = {zdi\Wdi — j},^ -3 = z\z\ and 
s - - 7 = s\sK We update Sj using a Metropolis step where s P ropose is accepted 
over s <j urrent with a probability 

f j.iU.zi.z j ..s ■ il nz^dz^ 1 

m \ ' fp{W,zj,z~3,sf rrent ,s-j) P *(zi)dzi J [ ' 

where p*(z^) = p{z~i , Sj, s - -') is the conditional distribution of z^ given all 
the others. If proposed or current Sj is 0, z J disappears in the joint likelihood as 
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p(W, z~ J , Sj = 0, s~ 3 ) and we do not need to marginalize over z 3 . If proposed 
or current Sj is 1, marginalization over z 3 does not yield a closed form and we 
rely on a Monte Carlo integration as follows. 

J piW.z'.z >.s, = l, S - j )p*(z j )dz j 

= Y,u=MW,z*M,z-i, Sj = l, a -*) 

u 1 ' 

where z 3 ^ is uth sample obtained from p*(z 3 ) as in equation §5§. Once we 
obtain posterior samples of s, inference about s is done through 

Sj = argmax s<t < T p( s y ) | W, Rest® ) (8) 
where T is the total number of iterations and B is the burn-in count. 



4 Simulation Study 

We first verify the correctness of our model with a synthetic dataset. We gen- 
erate the synthetic corpus as follows. We start with thirty-five words in the 
vocabulary, design ten topics such that each topic has five topic (informative) 
words with 0.2 probability each and zero probability for all other words. Then 
we add a non-informative set with ten words, that do not appear in any of the 
topics, with 0.1 probability each. The first row in Figure [l] shows these hand- 
crafted topics. As the figure shows, there are twenty-five informative words and 
ten non-informative words. Based on these topics, we generate 200 documents, 
40 to 50 tokens each, with random topic proportions drawn from the Dirichlet 
distribution with a symmetric prior of 0.1. For each document, we set r to 0.6, 
which means 60% of word tokens are drawn from the topics, and 40% word 
tokens are drawn from the non-informative set. 

With this synthetic corpus, we trained vsLDA with ten topics and LDA with 
ten (LDA10) and eleven (LDA11) topics. For the hyperparameters, we place 
an asymmetric a prior over the document topic proportions, a symmetric /3 
prior over the topic-word distributions, and a symmetric 7 prior over the non- 
informative word distribution. These asymmetric a and symmetric j3 priors can 
improve the performance of LDA compared to the widely used symmetric a and 
/3 priors [50]. During the inference steps, we optimized these hyperparameters 
by using Minka's fixed point iteration [T7] except 7 which we set to 1. Finally, 
we place Beta(l,l) priors over the hyperparameters A and r. 

Figure [T] shows the topics inferred by each model. The topics from LDA10 
and LDA11 look less clear than the topics from vsLDA. By design, vsLDA infers 
the non-informative words and explicitly excludes them from the topics, so the 
resulting topics distribute all of the probability over the informative words, 
thereby discovering topics with clearer patterns. In this simulation, vsLDA 
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Dataset # of docs # of words # of tokens Stopwords 



20NG 


2,000 


3,608 


155,622 


No 


NIPS 


1,740 


2,613 


104,069 


Yes 


SigGraph 


783 


2,808 


54,804 


No 



Table 1: Dataset statistics. The stopwords column indicates whether stopwords 
were kept (yes) or removed (no). 
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Figure 2: Change of the number of non-informative words over the number of 
topics (K). 



exactly captures the top five words in each topic, and correctly identifies, using 
equation (JsJ , the set of non- informative words. LDA10 finds the top five words 
in each topic quite well. However, every topic identified by LDA10 distributes 
some probability over the non-informative words, so the topics are not clearly 
defined by the five topic words. One interesting point of discussion is that 
typically LDA with asymmetric a priors are known to capture the common 
words of the corpus into a topic, so we expect that LDAll would capture the 
non-informative words in its eleventh topic. However, topic number 11 in LDAll 
actually captures an ambiguous distribution which has a sparse distribution over 
the words. However, if we adjust r to be smaller than 0.3, LDAll captures the 
non-informative words into one topic as well. 



5 Empirical Study 

In this section, we analyze three corpora for comparing vsLDA with two variants 
of LDA using various evaluation metrics. The first two are abstracts collected 
from the proceedings of the ACM SigGraph conferences (SigGraph) and the 
proceedings of the NIPS conferences (NIPS), and the third dataset is from the 
five comp subcategories from the 20 newsgroup corpus (20NG). To show the 
performance of vsLDA for diverse settings, we test NIPS with stopwords kept 
and SigGraph and 20NG with stopwords removed. The detailed statistics of 
the three datasets are in Table [I] We compare three models: vsLDA, LDA 
with asymmetric priors for 9 (asymLDA), and LDA with symmetric priors for 
9 (symLDA). Each model was run five times where each run includes 5,000 
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Figure 3: The scatter plot of ctf-idf versus relative df (rdf = df / total # of 
documents) for the I words (red square) and the Nl words (blue circle) inferred 
by vsLDA with K = 50 for each corpus. Generally, I words tend to have higher 
ctf-idf and lower rdf than the Nl words. 



iterations with 3,000 burn-in samples and 100 iterations used as a thinning 
interval. Other parameters were optimized in the same way as the previous 
section. 

Characteristics of informative and non-informative words We 

first describe the summary statistics of the informative (I) and the non-informative 
(Nl) words found by vsLDA and explain the interesting patterns found. Figure 
[2] shows the pattern of how many words are found to be Nl as we vary K, the 
number of topics. For NIPS and SigGraph, the number of Nl words does not 
vary for K of 25, 50, 75, and 100. For 20NG, the number noticeably decreases 
as K increases. Further investigation is needed to explain this phenomenon, 
but as Table [3] shows, log likelihood of heldout data follows a similar trend, so 
one conjecture is that the optimal number of topics is related to the number 
of Nl words for a given corpus. The proportion and the absolute number of 
the Nl words clearly differ for each corpus. On average, vsLDA categorizes 
around 38%, 70%, and 86% of the words as Nl for 20NG, SigGraph, and NIPS, 
respectively. 

To compare the characteristics of the Nl and the I words, we compute three 
summary statistics of the words: (1) corpus term frequency (freq), (2) document 
frequency (df), and (3) corpus tf-idf (ctf-idf) . Table [2] shows these statistics for 
ten words from 20NG and ten words from NIPS. To examine if any of the statis- 
tics are associated with word informativity, we ordered the words decreasingly 
by freq for 20NG and by ctf-idf for NIPS. Noting that there is no systematic 
pattern in the distribution of I and Nl words in both orderings, we confirm 
that each statistic alone is not sufficient to distinguish the two classes of words 
inferred by vsLDA. 

However, we found that the ctf-idf can be useful to quantify word informa- 
tivity combined with the relative df (rdf — df / total # of documents). Figure [3] 
shows a scatter plot of ctf-idf versus rdf for the I words (red square) and the Nl 
words (blue circle) inferred from vsLDA. In particular, Figure 3(c)| is a close-up 
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(a) 20NG 






word 


frcq 


df 


ctf-idf 


category 


subject 


1,855 


1,715 


1.08 


NI 


rc 


970 


915 


l.UU 


NT 


windows 


918 


356 


2.58 


I 


writes 


822 


653 


1.26 


NI 


file 


766 


206 


3.72 


I 


article 


686 


537 


1.28 


NI 


don't 


597 


394 


1.52 


NI 


SCSI 


592 


89 


6.65 


I 


program 


582 


241 


2.41 


NI 


drive 


569 


199 


2.86 


I 




(b) NIPS 






word 


frcq 


df 


ctf-idf 


catop'orv 


the 


6,764 


1,524 


4.44 


NI 


of 


4,849 


1,486 


3.26 


NI 


speech 


124 


71 


1.75 


I 


localization 


19 


11 


1.73 


I 


is 


1,791 


1,042 


1.72 


NI 


learning 


647 


397 


1.63 


NI 


recurrent 


77 


58 


1.33 


I 


hidden 


132 


106 


1.25 


I 


feature 


66 


53 


1.25 


NI 


can 


493 


396 


1.24 


NI 



Table 2: Basic statistics for the I words and the NI words inferred by vsLDA 
with K = 50 for 20NG and NIPS. The words are ordered decreasingly by frcq 
(20NG) and by ctf-idf (NIPS), and neither ordering shows a systematic pattern 
of word informativity. 



of the lower-left corner where most of the words are located for 20NG. As shown 
in Figure 3, the I words tend to show higher ctf-idf and lower rdf than the NI 
words, suggesting that the I words are the ones that appear in a few documents 
(low rdf) but with high frequency (high ctf-idf. For SigGraph with K = 50, the 
average ctf-idf of I words is 2.26 and the average ctf-idf of NI words is 1.27. The 
other corpora at all levels of K show the same pattern. As shown in Figure 



3(c) many of the words show low rdf and classification of these words mainly 
depends on the high/low ctf-idf. 

In addition, Table [2] shows that the words normally categorized as stop 
words, such as "the" and "is" are correctly identified as NI, as are the words 
that do not distinguish topics, such as "learning" and "feature" in NIPS. We 
also found that the NIPS corpus contains 284 stopwords, and vsLDA identified 
91% of them on average as NI words. 

Held-out likelihood vsLDA divides the vocabulary into I and NI, two 
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Figure 4: Average classification accuracies using 20NG. vsLDA outperforms 
symLDA and asymLDA on document classification. 

mutually exclusive sets that are unpredictable given the basic word statistics. 
Now, we describe the performance of vsLDA using held-out likelihood which 
measures the model's predictive performance for an unseen document based 
on the trained parameters. We split the corpus into a training set containing 
90% of documents and a test set containing the rest. We compute held-out 
likelihoods using a left-to- right style sampler [21] with maximum a posteriori 
(MAP) estimators of parameters <f>, ip, s, and f . The average word likelihoods 
are shown in Table [3j and these results are consistent with the values reported 
in other studies of LDA with symmetric priors [3] and LDA with asymmetric 
priors [20 . Overall, the held-out likelihoods of vsLDA are higher than symLDA 
and comparable to asymLDA. It is worth noting again that vsLDA excludes 
the NI words, which make up 40% to 80% of the vocabulary, from the topics, 
and it still performs comparable to asymLDA which uses all of the words for 
the topics. These results suggest that including the NI words in forming the 
topics does not contribute to the predictive power of the model. To further 
test vsLDA, we manually set the stop words as the non-informative words, 
trained the model (stopword-vsLDA), and computed held-out likelihoods. These 
holdout likelihoods were lower than vsLDA and asymLDA. These results verify 
that variable selection must be done within the model in combination with the 
topics, rather than as a preprocessing step. 

Classification A topic model can be used for dimensionality reduction 
because it expresses each document as a finite mixture of topics. One way to 
verify the performance of a topic model is to perform classification tasks by 
using these reduced dimensions [2j [13]. We use the five subcategories of the 
20NG dataset and classify the documents into the subcategories. We use the 
libSVM toolkit with linear kernels, performing one-vs-all classification on each 
category with ten-fold cross validation. Figure [4] shows the average accuracies of 
the classification results. Overall, vsLDA performs better than all others with 
a small difference between vsLDA and asymLDA. We can conclude that vsLDA 
reduces each document into more discriminating subdimensions by excluding 
the non-informative words. 

Similarity between multiple MCMC outputs with best matching 
algorithm LDA with asymmetric priors tend to generate highly skewed 
distributions [20] where the model will capture several major topics well, but 
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(a. 


) 20ng.comp 








vsLDA 


asymLDA 


syniLDA 




25 


-7.04 


-7.00 


-7.35 




50 


-6.94 


-6.87 


-7.37 




1 


-6.88 


-6.82 


7 ot: 
-1 .00 




100 


-6.84 


-6.78 


-7.36 






(b) SigGraph 








vsLDA 


asymLDA 


symLDA 




9^ 


-7.11 


-7.04 


7 91 




50 


-7.09 


-7.02 


-7.25 




75 


-7.07 


-7.00 


-7.25 




100 


-7.06 


-6.99 


-7.27 








(c) NIPS 






vsLDA 


asymLDA symLDA 


stopword-vsLDA* 


25 


-6.28 


-6.25 


-6.34 


-6.32 


50 


-6.28 


-6.25 


-6.43 


-6.34 


75 


-6.28 


-6.25 


-6.40 


-6.34 


100 


-6.28 


-6.25 


-6.43 


-6.35 



Table 3: logP(W tcst \W) /N tcst for various values of K for the three corpora. 
vsLDA performs comparable to asymLDA. For stopword-vsLDA, we manually 
set the NI words with stopwords. stopword-vsLDA performs comparable to 
symLDA but worse than vsLDA. 

the other topics may be highly inconsistent over multiple MCMC outputs. In 
the experiments presented here, for instance, five major topics occupy more 
than 50% of word tokens in the corpus. This may pose a problem for cases 
where the inferred topics <p 1 not just the topic assignments for the word tokens, 
are important. Variation of information (VI) is one metric to evaluate the 
performance of clustering [TBJ [SHI , but VI is based on mutual information of the 
topic assignments of tokens, so the major topics of the asymmetric models will 
overtake the VI metric, thereby masking the inconsistencies of the minor topics. 

In order to better measure the consistency of the model with respect to 
the topics 0, we propose a new similarity metric based on the best matching 
algorithm. First, based on the inferenced K maximum a posterior (MAP) 4>s for 
each MCMC output, we find the best matching K pairs that minimize the sum 
of symmetric KL-divergence with the Hungarian algorithm [6l [12] . If the model 
generates consistent topics over multiple runs, then the sum of the divergences 
will also be minimized. Table [4] shows the average divergences between the 
best matching pairs, and it shows that vsLDA finds more consistent topics 
than asymLDA and comparable results with symLDA. The inconsistencies of 
asymLDA can be attributed to the minor topics for which the corpus does not 
exhibit regular word-topic patterns. Although vsLDA may also generate skewed 
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(a) K=50 
20ng.comp NIPS SigGraph 



vsLDA 

asymLDA 

symLDA 



3.12 2.49 3.22 
3.68 5.96 4.45 
2.74 2.21 2.68 



(b) K=100 



20ng.comp NIPS SigGraph 



vsLDA 

asymLDA 

symLDA 



3.68 2.48 3.21 
4.40 7.38 5.77 
3.00 2.20 2.69 



Table 4: Average symmetric KL divergence between best matching topic pairs. 
vsLDA shows similar average divergences compared to symmetric LDA despite 
its asymmetric priors. 

distributions, vsLDA would use the NI category to exclude the words that do 
not exhibit clear topic patterns, so the resulting topics are more consistent and 
robust to the initializations of multiple MCMC runs. 

We also measure the consistency of vsLDA by looking at the NI word sets 
over multiple runs and computing the Jaccard's coefficient, which measures the 
degree of overlap of two sets by dividing the intersection by the union. Although 
we do not know the 'ground truth' of the NI word set, we certainly do not expect 
it to change for each run. The Jaccard's coefficients for multiple MCMC runs 
are, on average, 0.83, 0.96, and 0.93 for 20NG, NIPS, SigGraph, respectively, 
and these values represent high consistencies over multiple runs. 

6 Discussion 

We developed a variable selection model for LDA which selects a subset of the 
vocabulary to better model the topics. We were motivated by the curiosity 
about the usual practice of using the entire vocabulary to model the topics and 
the ad-hoc nature of the preprocessing steps to reduce the vocabulary size. Our 
model, vsLDA, explicitly selects the non-informative words to exclude from the 
vocabulary, simultaneously with the inference of the latent topics. By only using 
the words that help, not hinder, the process of inferring the topics, our model 
combines the advantages of LDA with symmetric priors and LDA with asym- 
metric priors. One future direction for vsLDA is to apply it to online learning 
[TUl [2"4"] . Typically, in an online learning situation, the vocabulary size gets 
larger as more data become available, but we cannot use the entire vocabulary 
because it monotonically increases [14]. By using vsLDA we can control the 
effective size of the vocabulary. Also, vsLDA can be used for object recognition, 
image segmentation [221 US] , or collaborative filtering [TTJ [T5] because vsLDA 
finds topics with more discriminative power. With vsLDA, we showed one way 
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of incorporating variable selection into LDA and improving the results, so the 
natural next step would be to incorporate variable selection into other topic 
models pj [13j HH [8] for improved results. 
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